library(tidyverse)
library(knitr)
library(broom)
library(stringr)
library(modelr)
library(forcats)
library(ggraph)
library(igraph)
options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())A network (also sometimes referred to as a graph) is a set of relationships. Networks contain a set of objects (nodes/verticies) and a mapping or description of relations between the nodes (link/edge). For example, a simple network contains two objects, 1 and 3, and one relationship, or edge, that links them:
The above edge is an undirected edge - there is no directionality to the relationship between 1 and 2. A directed edge is an ordered pair of nodes, with an arrow drawn to indicate the directionality of the edge:
There are lots of ways to analyze and measure networks. Let’s first look at two examples of network analysis applied to real-life data, then circle back and talk about packages for network visualization and analysis in R.
The Rise of Partisanship in the U.S. House of Representatives
There is little argument that political polarization is occurring in the United States. Partisanship can be attributed to many causes, including:
In this paper, the authors studying partisanship in the U.S. House of Representatives by examining relationships between legislators. On one hand, legislators are pressured by party leaders to vote with members of their own party with incentives such as committee assignments and campaign funding used to keep members in check. Alternatively, legislators have individual incentives to cooperate with members of the opposite political party (responsiveness to individual constituencies’ concerns).
Here, the authors measure the extent to which legislators form ideological relationships with members of the opposite party by examining cooperation rates between individual members of Congress on roll-call votes. In this research design, legislators are the nodes and the frequency of aggreement on roll call votes are the edges. Edges are calculated for each pair of legislators serving in a two-year term of Congress, resulting in nearly 6,000,000 pairs of legislators. The data is represented in matrix form with the rows and columns identifying each legislator-term observation and the cells identifying the frequency of voting together in that term. Therefore the network/graph is undirected (it doesn’t matter the ordering of the pair, they will still have the same number of votes in agreement with one another).
Pairs are defined as either cross-party (comprised of a single Republican and a single Democrat) or same-party (comprised of two Democrats or two Republicans). We can then think of partisan affiliation as an attribute of each node, and the cross- or same-party pairing being an attribute of each edge.
This figure from the article is a fairly typical visualization of network data known as a node-edge diagram. Nodes are drawn as point marks and the edges connecting them are drawn as line marks. Drawing node-edge diagrams is tricky for the simple reason that you need to determine where to draw each node on a two-dimensional coordinate plane. The problem is that there is no inherent or meaningful value of x-y coordinates. In this data, there is no attribute/variable that tells us where to draw these nodes.
Instead, node-edge diagrams rely on algorithms to determine spatial positioning in the visualization. There are many algorithms that can perform this task, but they typically consider connectivity and distance. Distance is defined as the number of edges along the shortest path connecting two nodes. Nodes that are tightly connected by shorter distances would therefore be grouped closer together in the visualization. Nodes that are loosely connected by many edges or links should be farther apart on the graph.
One of the most common network layouts is force-directed placement.
In one variant, network elements are positioned according to a simulation of physical forces where nodes push away from each other while links act like springs that draw the endpoint nodes closer to each other. Typically this method places nodes randomly within the spatial region and iteratively refines the locations by gradually shifting nodes around until the layout improves and stabilizes.
Under this method, absolute spatial position does not encode any meaningful value. The algorithm is designed to minimize distracting artifacts that might confuse the viewer (e.g. edge crossings, node overlaps), so spatial location is merely a side effect. While absolute position is meaningless, relative spatial location can be meaningful. Tightly interconnected groups of nodes should be drawn relatively close together, which could indicate a substantive clustering. However this could also be an artifact of the algorithm. Alternative measures such as centrality are more robust measures of relative node importance in a network.
Because these algorithms start with random placement, in order to exactly reproduce a network visualization you should remember to set your random number seed at the beginning of the script (
set.seed()in R).
In this graph, edges are drawn between legislators who agree above the Congress’ threshold value of votes - this is defined as the average level of cooperation within each Congress. If the authors did not do this, every pair with at least one vote in agreement would have an edge drawn, making the graph even more complex.
We also see that the authors encode additional information in the visualization through extra channels:
Hairball diagrams are those with so many nodes and edges that the diagram becomes a jumbled mess and interpretation becomes extremely difficult. A general rule of thumb is that if the number of nodes is more than 4 times the number of edges, straight-forward force-directed placement will not be optimal.
In this example, the individual network diagrams are pretty bad if you want to interpret them at the legislator level. For instance, consider the 2011 graph:
Trying to track all the nodes and edges to identify legislators with the most ties (cooperative pairings) is downright impossible. There is just too much going on here. However, the visualization is very good at depicting the increasing partisanship in the U.S. House of Representatives. Democrats and Republicans are clustered together on the graph (partially the algorithm and partially the fact that legislators vote most frequently with members of their own party). It is easy to see that in the 1950s and 60s there were a lot of edges connecting legislators from both sides of the aisle. In fact we can even see more mixing of the nodes, where Republicans and Democrats are drawn more closely together on the grid. Over time, we can see both the number of cross-party edges decreasing and the spatial distance between the core Democrat and Republican clusters increasing, both outcomes of increased partisanship and decreasing cooperation.
devtools::session_info()## setting value
## version R version 3.3.3 (2017-03-06)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2017-05-12
##
## package * version date source
## assertthat 0.2.0 2017-04-11 cran (@0.2.0)
## backports 1.0.5 2017-01-18 CRAN (R 3.3.2)
## base * 3.3.3 2017-03-07 local
## broom * 0.4.2 2017-02-13 CRAN (R 3.3.2)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.3.0)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2)
## datasets * 3.3.3 2017-03-07 local
## DBI 0.6-1 2017-04-01 CRAN (R 3.3.2)
## devtools 1.13.0 2017-05-08 CRAN (R 3.3.2)
## digest 0.6.12 2017-01-27 CRAN (R 3.3.2)
## dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.0)
## evaluate 0.10 2016-10-11 CRAN (R 3.3.0)
## forcats * 0.2.0 2017-01-23 CRAN (R 3.3.2)
## foreign 0.8-68 2017-04-24 CRAN (R 3.3.2)
## ggforce 0.1.1 2016-11-28 CRAN (R 3.3.2)
## ggplot2 * 2.2.1.9000 2017-05-12 Github (tidyverse/ggplot2@f4398b6)
## ggraph * 1.0.0 2017-02-24 CRAN (R 3.3.2)
## ggrepel 0.6.5 2016-11-24 CRAN (R 3.3.2)
## graphics * 3.3.3 2017-03-07 local
## grDevices * 3.3.3 2017-03-07 local
## grid 3.3.3 2017-03-07 local
## gridExtra 2.2.1 2016-02-29 cran (@2.2.1)
## gtable 0.2.0 2016-02-26 CRAN (R 3.3.0)
## haven 1.0.0 2016-09-23 cran (@1.0.0)
## hms 0.3 2016-11-22 CRAN (R 3.3.2)
## htmltools 0.3.6 2017-04-28 cran (@0.3.6)
## httr 1.2.1 2016-07-03 CRAN (R 3.3.0)
## igraph * 1.0.1 2015-06-26 CRAN (R 3.3.0)
## jsonlite 1.4 2017-04-08 cran (@1.4)
## knitr * 1.15.1 2016-11-22 cran (@1.15.1)
## lattice 0.20-35 2017-03-25 CRAN (R 3.3.2)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0)
## lubridate 1.6.0 2016-09-13 CRAN (R 3.3.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
## MASS 7.3-47 2017-04-21 CRAN (R 3.3.2)
## memoise 1.1.0 2017-04-21 CRAN (R 3.3.2)
## methods * 3.3.3 2017-03-07 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.3.0)
## modelr * 0.1.0 2016-08-31 CRAN (R 3.3.0)
## munsell 0.4.3 2016-02-13 CRAN (R 3.3.0)
## nlme 3.1-131 2017-02-06 CRAN (R 3.3.3)
## parallel 3.3.3 2017-03-07 local
## plyr 1.8.4 2016-06-08 CRAN (R 3.3.0)
## psych 1.7.5 2017-05-03 CRAN (R 3.3.3)
## purrr * 0.2.2.2 2017-05-11 CRAN (R 3.3.3)
## R6 2.2.1 2017-05-10 CRAN (R 3.3.2)
## Rcpp 0.12.10 2017-03-19 cran (@0.12.10)
## readr * 1.1.0 2017-03-22 cran (@1.1.0)
## readxl 1.0.0 2017-04-18 CRAN (R 3.3.2)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.3.0)
## rlang 0.1.9000 2017-05-12 Github (hadley/rlang@c17568e)
## rmarkdown 1.5 2017-04-26 CRAN (R 3.3.2)
## rprojroot 1.2 2017-01-16 CRAN (R 3.3.2)
## rvest 0.3.2 2016-06-17 CRAN (R 3.3.0)
## scales 0.4.1 2016-11-09 CRAN (R 3.3.1)
## stats * 3.3.3 2017-03-07 local
## stringi 1.1.5 2017-04-07 CRAN (R 3.3.2)
## stringr * 1.2.0 2017-02-18 CRAN (R 3.3.2)
## tibble * 1.3.0.9002 2017-05-12 Github (tidyverse/tibble@9103a30)
## tidyr * 0.6.2 2017-05-04 CRAN (R 3.3.2)
## tidyverse * 1.1.1 2017-01-27 CRAN (R 3.3.2)
## tools 3.3.3 2017-03-07 local
## tweenr 0.1.5 2016-10-10 CRAN (R 3.3.0)
## udunits2 0.13 2016-11-17 CRAN (R 3.3.2)
## units 0.4-4 2017-04-20 CRAN (R 3.3.2)
## utils * 3.3.3 2017-03-07 local
## viridis 0.4.0 2017-03-27 CRAN (R 3.3.2)
## viridisLite 0.2.0 2017-03-24 cran (@0.2.0)
## withr 1.0.2 2016-06-20 CRAN (R 3.3.0)
## xml2 1.1.1 2017-01-24 CRAN (R 3.3.2)
## yaml 2.1.14 2016-11-12 cran (@2.1.14)